How do social scientists
answer questions using data?

PSCI 2270 - Week 4

Georgiy Syunyaev

Department of Political Science, Vanderbilt University

September 24, 2024

Plan for this week



  1. Some math… LLN and CLT

  2. Logic of causal inference

Plan for this week


  1. Some math… LLN and CLT

Some building blocks


  • Probability:

    • Basis for understanding uncertainty in our estimates
    • E.g. how likely that the mean of our sample is \(x\) distance away from the population mean
  • Law of Large Numbers

    • As sample size increases (e.g., roll a dice many times)
    • Average of the sample converges to the truth
  • Central Limit Theorem:

    • If we collect many (large) samples
    • Result follows the normal distribution

Random samples


  • Suppose we collect \(n\) observations: \(X_1\) , \(X_2\), … , \(X_n\)

    • \(X_1\) is the age of the first randomly selected registered voter.
    • \(X_2\) is the age of the second randomly selected registered voter, etc.
  • We then summarize these \(n\) observations by calculating a statistic, e.g. mean

    • All statistical procedures involve a statistic, very often sum or mean.
    • What are the properties of these sums and means?
    • Can the sample mean of age tell us anything about the population distribution of age?
  • How do we know how far is this away from summary of all \(N\) (or even infinite number of!) units in population?

    • Key idea: Can we say something about distribution of statistics across (hypothetical) re-samples?
    • Key tool: Increase sample size to very large asymptotics

Stats Lingo: LLN


Law of Large Numbers (LLN)

Let \(X_1\) , … , \(X_n\) be i.i.d. random variables with mean \(\mu\) and finite variance \(\sigma^2\). Then, \(\bar{X}_{n}\) converges to \(\mu\) as \(n\) gets large.


  • The probability of \(\bar{X}_n\) being “far away” from \(\mu\) goes to \(0\) as \(n\) gets big
  • Intuition: If we roll the six-sided die many times, what do you think the average of rolls will be?
  • Important result: The distribution of sample means “collapses” to population mean if we draw many samples

Stats Lingo: CLT


Central Limit Theorem (CLT)

Let \(X_1\) , … , \(X_n\) be a sample from population with mean \(\mu\) and variance \(\sigma^2\). Then, \(\bar{X}_n\) (sample mean) will be approximately distributed \(N ( \mu, \sigma^2 / n )\) as \(n\) goes to infinity.


  • Intuition: Imagine you can collect many (large) samples and for each calculate a mean, resulting means form sampling distribution that has good properties

  • Important result: We now know how far away \(\bar{X}_n\) can be from population mean of \(X_i\) (not \(\bar{X}_n\))!

Normal Distribution

  • Distribution possible value of \(X\) \(\rightarrow\) probability of \(X\) taking this value
  • The Normal distribution is one of the most ubiquitous distributions in statistics

    • Follows “bell-shaped” curve
    • Mean and variance are two key characteristics (parameters) and are given in parentheses
    • When \(X\) is distributed normally, we write \(X \sim N ( \mu, \sigma^2 )\)
  • Three key properties:

    • Unimodal: one peak at the mean
    • Symmetric around the mean
    • Everywhere positive: any real value can possibly occur

Impications of CLT/LLN


  • By CLT, sample mean is a draw from Normal distribution with mean \(\mu\) and variance of \(\sigma^2 / n\)
  • Using properties of Normal distribution sample mean will be within \(2 \times \sigma^2 / n\) of the population mean 95% of the time
  • We usually only have one sample, so we’ll only get one sample mean. So why do we care about LLN/CLT?

    • CLT gives us assurances our sample mean won’t be too far from population mean
    • CLT also helps us create measure of uncertainty for our estimates: standard deviation of the sampling distribution, or standard error (SE):

    \[ SE = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}} \]

Putting the Concepts to Work

  • Question: What proportion of the public approves of Biden’s job as president?
  • Latest Gallup poll:

    • September 3-15
    • 1007 adult Americans
    • Telephone interviews using RDD
    • Approve (39%), Disapprove (58%)
    • Devil in the Details
  • What can we learn about Biden’s approval in the population from this one sample?

Samples from Population



  • Our data: simple random sample of size \(n\) from some population \(X_1\) , … , \(X_n\)

    • Each individual is independently drawn \(\Rightarrow\) \(i.i.d.\) random variables
    • \(X_i = 1\) if \(i\) approves of Biden, \(X_i = 0\) otherwise
  • We will use data to guess something about the population distribution of \(X_i\) \(\Leftarrow\) statistical inference

Point Estimation


  • Point estimation: providing a single “best guess” as to the value of some fixed, unknown quantity of interest, \(\theta\) (read theta)

    • \(\theta\) is a feature of the population distribution
    • Also called parameters
  • Examples of quantities of interest ( estimands ):

    • \(\mu = \mathbb{E} [ X_i ]\): the population mean (turnout rate in the population)
    • \(\sigma^2 = \mathrm{Var}[ X_i ]\): the population variance
    • \(\mu_1 - \mu_0 = \mathbb{E} [ X_i (1) ] - \mathbb{E} [ X_i (0) ]\): the population Average Treatment Effect (ATE)
  • These are the things we want to learn about

Estimators


Estimator

An estimator, \(\hat{\theta}\), of some parameter \(\theta\), is a statistic: \(\hat{\theta} = h(X_1 , ... , X_n )\).

  • An estimate is one particular realization of the estimator

    • Ideally we’d like to know the estimation error (bias) \(\theta - \hat\theta\)
    • Problem: \(\theta\) is unknown
    • Solution: figure out the properties of \(\theta\) using probability
  • \(\hat{\theta}\) is a random variable because it is a function of sequence of random draws \(\Rightarrow\) CLT and LLN apply!

Estimating Biden’s support


  • Parameter \(\theta\): population proportion of adults who support Biden
  • There are many (\(\infty\)) different possible estimators:

    • \(\hat{\theta} = \bar{X}_n\) : the sample proportion of respondents who support Biden
    • \(\hat{\theta} = X_1\) : just use the first observation
    • \(\hat{\theta} = \max( X_1 , ... , X_n )\) : pick the maximum of all observations
    • \(\hat{\theta} = 0.5\) : always guess 50% support
  • How good are these different estimators?

  • We usually rely on mean, partly because it makes LLN and CLT apply

Sample mean properties



\[ \underbrace{\text{estimate}}_{\text{sample mean, }\bar{X}} = \underbrace{\text{estimand}}_{\text{population mean, }p} + \text{noise} \]

  • Remember: the sample mean is a random variable

    • Different samples give different sample means
    • Noise “bumps” sample mean away from population mean; We assume away bias
    • \(\bar{X}\) has a distribution across repeated samples–sampling distribution–that we know has nice properties

Central tendency of the sample mean



  • Expectation: average of the estimates across repeated samples

    • From LLN: \(\mathbb{E}[\bar{X}] = \mathbb{E}[ X_i ] = p\)
    • \(\rightarrow\) noise is \(0\) on average: \[\mathbb{E}[\bar{X} − p] = \mathbb{E}[\bar{X}] − p = 0\]
  • UnBIASedness: Sample proportion is on average equal to the population proportion

Spread of the Sample Mean


  • Standard error: how big is the noise on average?
  • We can use a special rule to binary random variables to calculate standard deviation (SD):

\[\sqrt{\mathrm{Var}(\bar{X})} = \sqrt{\frac{p(1 − p)}{n}}\]

  • Problem: we don’t know \(p\)!
  • Solution: estimate the SE using sample mean

\[\sqrt{\widehat{\mathrm{Var}}(\bar{X})} = \sqrt{\frac{\bar{X}(1 − \bar{X})}{n}} \class{fragment}{= \sqrt{\frac{0.39 (1 − 0.39)}{1007}} \approx 0.0153}\]

Confidence Intervals



  • Awesome: Sample proportion is correct on average; standard error estimates how variable sample proportion is across (hypothetical) samples
  • Awesome-r: Get a range of plausible values
  • Confidence interval: way to construct an interval that will contain the true value in some fixed proportion of repeated samples

Using CLT

\[ \bar{X} − p = \text{noise}\]

  • How can we figure out a range of plausible noise?

    • Find a range of plausible noise and add them to X
  • Central Limit Theorem:

\[\bar{X} \sim N \left( \underbrace{\mathbb{E}[X_i]}_{p}, \underbrace{\frac{\mathrm{Var}(X_i)}{n}}_{\frac{p(1-p)}{n}} \right)\]

  • Noise: \(\bar{X} − p\) is approximately normal with mean 0 and SD equal to \(\sqrt{\frac{p(1-p)}{n}}\)

Confidence interval



  • First, choose a confidence level.

    • What percent of noise do you want to count as “plausible”?
    • Convention is \(95\%\).
  • \(100 \times (1 − \alpha)\) % confidence interval: \(CI = X ± z_{\alpha/2} \times SE\)

    • In polling, \(\pm z_{\alpha/2} × SE\) is called the margin of error

CIs for the Gallup Poll


  • Gallup poll: \(\bar{X} = 0.39\) with an SE of \(0.0153\)
  • 90% CI: \[[0.39 − 1.64 × 0.0153, 0.39 + 1.64 × 0.0153] = [0.364, 0.415]\]
  • 95% CI: \[[0.39 − 1.96 × 0.0153, 0.39 + 1.96 × 0.0153] = [0.360, 0.420]\]
  • 99% CI: \[[0.39 − 2.58 × 0.0153, 0.39 + 2.58 × 0.0153] = [0.351, 0.429]\]
  • Less confidence \(\rightarrow\) Wider intervals

95% CI’s


95% CI’s


95% CI’s


95% CI’s


95% CI’s


95% CI’s


Summary


  • We know that sample mean is unbiased estimate of population mean \(\Rightarrow\) use it as our (point) estimate of population mean
  • According to CLT the sample means are normally distributed with variance of \(\sigma^2 / n\) \(\Rightarrow\) use sample mean and sample size to calculate estimate of uncertainty (sampling variance)
  • Come up with confidence level (usually 95%) and use mean and variance estimates to see where the population mean is likely to be if we would to re-sample many times \(\Rightarrow\) Calculate range of ~2 standard errors (square root of sampling variance) around the estimated population mean
  • 95% confidence interval includes population mean in 95% of re-samples

Plan for this week


  1. Some math… LLN and CLT
  1. Logic of causal inference

External vs Internal Validity



  • So far we focused on ability to study population using just a sample \(\Rightarrow\) external validity
  • This is important if all relevant outcomes within sample are observed

    • e.g. sample mean, median, polarization, etc.
  • But, for causal (“what-if”) question we cannot observe all relevant outcomes within sample \(\Rightarrow\) internal validity

    • Why?

Fundamental Problem of Causal Inference


  • Factual vs. Counterfactual
  • Does the minimum wage increase the unemployment rate?

    • Factual: Unemployment rate in US went up after the minimum wage increased
    • Counterfactual: Would it have gone up if the minimum wage increase not occurred?
  • Does having a daughter affect a judge’s rulings in court?

    • Factual: A judge with a daughter gave a pro-choice ruling
    • Counterfactual: Would they have done that if they had a son instead?
  • Fundamental problem of causal inference:

    • We can never observe counterfactuals \(\Leftarrow\) must be inferred or assumed

Fundamental Problem in Movies


Example



  • Question: Does having a female as a head of a village council increase share of budget allocated to water sanitation?

  • Setting: 8 randomly sampled villages in Indonesia (some with female and some with male head)

  • Outcome: Share of budget each village spends on water sanitation

Compare Two Villages



Village Head of Council Budget Share
Village 1 Female 15%
Village 2 Male 10%



  • Question: Did the first village have larger share spent on water sanitation because the head of the council was female?
  • Concern: What if there are other differences between villages or the budget share leads to female council representation?

Experimental Lingo


  • Treatment/intervention, \(T_{i}\): Who is head of council in village \(i\) (was independent variable)
  • Treatment (\(T_i = 1\)) group: Villages with female head of council

  • Control (\(T_i = 0\)) group: Villages with male head of council

  • Outcome variable, \(Y_i\): Share of spending


Village \(T_i\) (Head of Council) \(Y_i\) (Budget Share)
Village 1 1 15
Village 2 0 10

Potential Outcomes


  • What does “\(T_i\) causes \(Y_i\)” mean?

    • Would a village with female and male head have different budget allocations?
  • Imagine two states of the world: one in which you receive some treatment and another in which you do not \(\Rightarrow\) potential outcomes

    • Treated, \(Y_i (1)\): spending on water sanitation if village \(i\) had a female head?
    • Untreated/Control, \(Y_i (0)\): spending on water sanitation if village \(i\) had a male head?

Treatment Effect(s)


  • (Individual) Treatment effect: \(Y_i (1) − Y_i (0)\)

    • \(Y_i (1) − Y_i (0) = 0\): gender of village head has on spending on water sanitation
    • \(Y_i (1) − Y_i (0) < 0\): female village head has negative effect on spending on water sanitation
    • \(Y_i (1) − Y_i (0) > 0\): female village head has positive effect on spending on water sanitation
  • Average Treatment Effect (ATE):

    \[ \frac{1}{n} \sum_{i = 1}^{n} Y_i (1) − \frac{1}{n} \sum_{i = 1}^{n} Y_i (0) = \frac{1}{n} \sum_{i = 1}^{n} \left[ Y_i (1) − Y_i (0) \right] \]

    • Difference between average treated and untreated potential outcomes
    • Does having female head of village lead to increase in spendings on water sanitation on average?
    • Note: You might come close to observing this treatment effect if you can observe the \(Y_i (0)\) the instant before an intervention and the \(Y_(1)\) the instant after, but strictly speaking, you are not observing them at the same time \(\Rightarrow\) Focus on ATE

Back to Fundamental Problem


Village \(T_i\) (Head of Council) \(Y_i\) (Budget Share) \(Y_i (0)\) (Budget Share if Male Head) \(Y_i (1)\) (Budget Share if Female Head)
Village 1 1 15 ??? 11 16 14 10 15
Village 2 0 10 10 ??? 12 7 9 15


  • What is your best guess about treatment effect?
  • Fundamental problem of causal inference:

    • We only observe one of the two potential outcomes.
    • In terms of potential outcomes: observe \(Y_i = Y_i (1)\) if \(T_i = 1\) or \(Y_i = Y_i (0)\) if \(T_i = 0\)
  • To infer causal effect, we need to infer the missing counterfactuals!
  • Can we assume potential outcomes are the same?

Matching?



  • Find a similar unit! \(\Rightarrow\) matching

    • Mill’s method of difference
  • Did village spend more on water sanitation because of female council head?

    • \(\rightarrow\) find a village that has male council head but very similar otherwise
  • NJ increased the minimum wage. Causal effect on unemployment?

    • \(\rightarrow\) find a state similar to NJ that didn’t increase minimum wage

Imperfect matches

  • The problem: imperfect matches!

  • Say we match villages \(i\) (treated) and \(j\) (control)

  • Selection Bias: \(Y_i (1) \neq Y_j (1)\) or \(Y_i (0) \neq Y_j (0)\)

  • Those who take treatment may be different from those who take control

  • How can we correct for that?

RANDOMIZE! 😵‍💫

(Social) Science Squad

Why Does it Work?



  • Fundamental problem of causal inference still prevents us from estimating individual treatment effects, BUT…
  • Random assignment enables us to create two groups whose treated and untreated potential outcomes are the same in expectation
  • The treatment group provides us with a random sample of \(Y_i (1)\), and the control group provides us with a random sample of \(Y_i (0)\)
  • The difference-in-means estimator compares average outcomes between two samples: treatment and control group

\[ \text{Difference-in-means} = \bar{Y}_{\text{treated}} - \bar{Y}_{\text{untreated}} \]

Core Assumptions

  1. Random assignment of subjects to groups: Implies that receiving the treatment is statistically independent of subjects’ potential outcomes
  1. Non-interference: A subject’s potential outcomes reflect only whether they receive the treatment themselves

    • A subject’s potential outcomes are unaffected by how the treatments happened to be allocated
  1. Excludability: A subject’s potential outcomes respond only to the defined treatment, not other extraneous factors that may be correlated with treatment

    • Importance of defining the treatment precisely and maintaining symmetry between treatment and control groups (e.g., through blinding)

Difference-in-means is Unbiased

Under assumptions 1-3 Difference-in-Means estimator produces unbiased estimates of the Average Treatment Tffect (ATE). In other words in expectation Difference-in-Means over many trials (within the same sample) we will get the Average Treatment Effect.

Observational Methods



  • All observational methods in statistics are trying to approximate randomization under assumptions!
  • Regression with covariates: Block possible known (?) confounders

  • Matching: Use known (?) covariates to match units and then compare within matched sets

  • Event study: Use before and after treatment comparison within the same unit

  • Regression Discontinuity: Use some naturally occurring discontinuity (e.g. tax exemption threshold, borders, etc.) to compare units around it

  • Differences-in-Differences: Compare trends between treated and untreated units even if we know there might be differences between them

  • Many other!

Back to Example


Village \(T_i\) (Head of Council) \(Y_i\) (Budget Share) \(Y_i (0)\) (Budget Share if Male Head) \(Y_i (1)\) (Budget Share if Female Head)
Village 1 1 15 ??? 15
Village 2 0 10 10 ???
Village 3 0 20 20 ???
Village 4 1 15 ??? 15
Village 5 0 10 10 ???
Village 6 1 15 ??? 15
Village 7 1 30 ??? 30
Village 8 0 10 10 ???
  • What do we substitute for ???? We need an educated guess …

Two applications: Uncertainty

  • Task: Estimate uncertainty of your difference-in-means estimate
  • Steps:

    • Assume the true treatment effect is your difference in means \(\Rightarrow\) reconstruct schedule of potential outcomes
    • We can use approach similar to re-sampling for sampling mean
    • Re-shuffle the treatment, for each unit take the potential outcome corresponding to new treatment see what differences-in-means you could get
  • Can also be done using parametric approach analogous to sample mean

\[ SE(\bar{Y}_{\text{treated}} - \bar{Y}_{\text{untreated}}) = \sqrt{\frac{\sigma_{\text{treated}}^2}{n_{\text{treated}}} + \frac{\sigma_{\text{untreated}}^2}{n_{\text{untreated}}}} \]

Two applications: Hypothesis testing



  • Hypothesis testing: There could be substantively interesting other treatment effects that we want to compare to what we observe

    • Each individual treatment effect is \(0\)
    • Average treatment effect is \(0\)
    • etc.
  • We call these values of quantity of interest hypotheses and can see how likely are we to observe our result under those hypotheses to test them

    • Reconstruct schedule of potential outcomes assuming hypothesis is true (e.g. individual treatment effect is 0)
    • Re-shuffle the treatment, for each unit take the potential outcome corresponding to new treatment see what differences-in-means you could get
    • See how likely are you to observe value at least as extreme as yours \(\Rightarrow\) \(p\)-value

References